Business Case #5 - Retail - Clustering

Authors:

Débora Santos (m20200748),Pedro Henrique Medeiros (m20200742), Rebeca Pinheiro (m20201096)

Group D - D4B Consulting

Importing file to clustering

Clustering

Kmeans - Clustering

To define the number of clusters, we plot 3 metrics to help us:

1) The Inertia plot shows the dispersion of the points within the cluster, meaning a small inertia is the best outcome.

2) The Average silhouette plot determines how well each object lies within its cluster, being the best outcome a higher number.

At last, the Davies-Bouldin score plot which index is based on a ratio between distances within the cluster and distances between clusters, the best outcome is the smallest index.

CLUSTERING BASED ON VALUES

Normalize data

We also ran the silhouette plot that shows a coefficient for the clusters’ quality depending on the number of clusters chosen. We applied in with some possible cluster numbers.

After looking to the plots, we decided to proceed with 3 clusters

Evaluation

The R2 is a measure of the homogenity of a cluster solution.

Describe clusters - Visualizations

T-SNE (t-Distributed Stochastic Neighbor Embedding) is most used for multidimensional visualizations. It has many application but in this work , our objective is visualize clusters within the data and how a cluster solution fits them

CLUSTER BASED ON PREFERENCES

CLUSTER BY PRODUCT CATEGORY

We also ran the silhouette plot that shows a coefficient for the clusters’ quality depending on the number of clusters chosen. We applied in with some possible cluster numbers.

After looking to the plots, we decided to proceed with 3 clusters

Evaluation

The R2 is a measure of the homogenity of a cluster solution.

Describe clusters - Visualizations

T-SNE (t-Distributed Stochastic Neighbor Embedding) is most used for multidimensional visualizations. It has many application but in this work , our objective is visualize clusters within the data and how a cluster solution fits them

CLUSTER PREFERENCE BY PRODUCT FAMILY

We also ran the silhouette plot that shows a coefficient for the clusters’ quality depending on the number of clusters chosen. We applied in with some possible cluster numbers.

After looking to the plots, we decided to proceed with 3 clusters

Evaluation

The R2 is a measure of the homogenity of a cluster solution.

Describe clusters - Visualizations

T-SNE (t-Distributed Stochastic Neighbor Embedding) is most used for multidimensional visualizations. It has many application but in this work , our objective is visualize clusters within the data and how a cluster solution fits them

After running the 3 tentatives of cluster we decide to keep only the cluster based on values. The cluster based on values presented the better results in terms of metrics and visualizations. We had the opportunity to do some analysis in power BI to compare the results and we detect: The cluster based on categories don't present good results because: a) the category 178 is very representative overall so its not help to segregate preferences and the other categories are to much distributed so it's generate sparsity. Regarding cluster based on family, it's presented some good results, but when it is compared with cluster based on values is worst. When we compare the results in power BI, we also detect that in terms of preferences of products the cluster based on values is very similar to the cluster based on family.